898 research outputs found
Pb-Hash: Partitioned b-bit Hashing
Many hashing algorithms including minwise hashing (MinHash), one permutation
hashing (OPH), and consistent weighted sampling (CWS) generate integers of
bits. With hashes for each data vector, the storage would be
bits; and when used for large-scale learning, the model size would be
, which can be expensive. A standard strategy is to use only the
lowest bits out of the bits and somewhat increase , the number of
hashes. In this study, we propose to re-use the hashes by partitioning the
bits into chunks, e.g., . Correspondingly, the model size
becomes , which can be substantially smaller than the
original .
Our theoretical analysis reveals that by partitioning the hash values into
chunks, the accuracy would drop. In other words, using chunks of
bits would not be as accurate as directly using bits. This is due to the
correlation from re-using the same hash. On the other hand, our analysis also
shows that the accuracy would not drop much for (e.g.,) . In some
regions, Pb-Hash still works well even for much larger than 4. We expect
Pb-Hash would be a good addition to the family of hashing methods/applications
and benefit industrial practitioners.
We verify the effectiveness of Pb-Hash in machine learning tasks, for linear
SVM models as well as deep learning models. Since the hashed data are
essentially categorical (ID) features, we follow the standard practice of using
embedding tables for each hash. With Pb-Hash, we need to design an effective
strategy to combine embeddings. Our study provides an empirical evaluation
on four pooling schemes: concatenation, max pooling, mean pooling, and product
pooling. There is no definite answer which pooling would be always better and
we leave that for future study
Constrained Approximate Similarity Search on Proximity Graph
Search engines and recommendation systems are built to efficiently display
relevant information from those massive amounts of candidates. Typically a
three-stage mechanism is employed in those systems: (i) a small collection of
items are first retrieved by (e.g.,) approximate near neighbor search
algorithms; (ii) then a collection of constraints are applied on the retrieved
items; (iii) a fine-grained ranking neural network is employed to determine the
final recommendation. We observe a major defect of the original three-stage
pipeline: Although we only target to retrieve vectors in the final
recommendation, we have to preset a sufficiently large () for each
query, and ``hope'' the number of survived vectors after the filtering is not
smaller than . That is, at least vectors in the similar candidates
satisfy the query constraints.
In this paper, we investigate this constrained similarity search problem and
attempt to merge the similarity search stage and the filtering stage into one
single search operation. We introduce AIRSHIP, a system that integrates a
user-defined function filtering into the similarity search framework. The
proposed system does not need to build extra indices nor require prior
knowledge of the query constraints. We propose three optimization strategies:
(1) starting point selection, (2) multi-direction search, and (3) biased
priority queue selection. Experimental evaluations on both synthetic and real
data confirm the effectiveness of the proposed AIRSHIP algorithm. We focus on
constrained graph-based approximate near neighbor (ANN) search in this study,
in part because graph-based ANN is known to achieve excellent performance. We
believe it is also possible to develop constrained hashing-based ANN or
constrained quantization-based ANN
Anti-Helicobacter pylori activity of steroidal alkaloids obtained from three Veratrum plants
Anti-Helicobacter pylori (HP) activities were examined, by disc method, on three total alkaloid fractions and fourteen steroidal alkaloids obtained from three Veratrum plants ( V. manckii, V. nigrum var. ussuriense and V. patulum) , which are used as a name of "Li-lu (藜蘆)" to treat aphasia arising from apoplexy, wind type dysentery, jaundice, headache, scabies, chronic malaria, etc. Among them, verapatulin (12) and veratramine (13) revealed anti-HP activities, and the disc-minimum inhibitory concentration (disk-MIC) value (10 μg/ml) of 12 against two standard HP strains, NCTC11637 and NCTC11916, was higher than that of a clinically used antibiotic, erythromycin (≦0.013 μg/ml) , but was comparable to those of penicillin G (3.1 μg/ml and 1.6 μg/ml, respectively). 漢薬"藜蘆"として用いられている3種のヴェラトラム属植物(V.maackii, V.nigrum var.ussuriense and V.patulum)から得た総アルカロイドフラクション3種及ぴステロイドアルカロイド14種について,抗ヘリコバクター・ピロリ活性をディスク法で測定した。調べたステロイドアルカロイドの中で,ヴェラパツリン(12)及ぴヴェラトラミン(13)が抗ヘリコバクター・ピロリ活性を示した。ヴェラパツリン(12)のヘリコバグター・ピロリ標準株2種(NCTC11637及ぴNCTC11916)に対するdisk MIC値は10μg/mlであり,臨床で用いられる抗生物質のエリスロマイシン(≦0.013μg/ml)よりは弱いが,ペニシリンG(各標準株に対して3.1μg/ml,1.6μg/ml)と同程度であった
Asymmetric Hashing for Fast Ranking via Neural Network Measures
Fast item ranking is an important task in recommender systems. In previous
works, graph-based Approximate Nearest Neighbor (ANN) approaches have
demonstrated good performance on item ranking tasks with generic
searching/matching measures (including complex measures such as neural network
measures). However, since these ANN approaches must go through the neural
measures several times during ranking, the computation is not practical if the
neural measure is a large network. On the other hand, fast item ranking using
existing hashing-based approaches, such as Locality Sensitive Hashing (LSH),
only works with a limited set of measures. Previous learning-to-hash approaches
are also not suitable to solve the fast item ranking problem since they can
take a significant amount of time and computation to train the hash functions.
Hashing approaches, however, are attractive because they provide a principle
and efficient way to retrieve candidate items. In this paper, we propose a
simple and effective learning-to-hash approach for the fast item ranking
problem that can be used for any type of measure, including neural network
measures. Specifically, we solve this problem with an asymmetric hashing
framework based on discrete inner product fitting. We learn a pair of related
hash functions that map heterogeneous objects (e.g., users and items) into a
common discrete space where the inner product of their binary codes reveals
their true similarity defined via the original searching measure. The fast
ranking problem is reduced to an ANN search via this asymmetric hashing scheme.
Then, we propose a sampling strategy to efficiently select relevant and
contrastive samples to train the hashing model. We empirically validate the
proposed method against the existing state-of-the-art fast item ranking methods
in several combinations of non-linear searching functions and prominent
datasets
Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection
Malicious use of deepfakes leads to serious public concerns and reduces
people's trust in digital media. Although effective deepfake detectors have
been proposed, they are substantially vulnerable to adversarial attacks. To
evaluate the detector's robustness, recent studies have explored various
attacks. However, all existing attacks are limited to 2D image perturbations,
which are hard to translate into real-world facial changes. In this paper, we
propose adversarial head turn (AdvHeat), the first attempt at 3D adversarial
face views against deepfake detectors, based on face view synthesis from a
single-view fake image. Extensive experiments validate the vulnerability of
various detectors to AdvHeat in realistic, black-box scenarios. For example,
AdvHeat based on a simple random search yields a high attack success rate of
96.8% with 360 searching steps. When additional query access is allowed, we can
further reduce the step budget to 50. Additional analyses demonstrate that
AdvHeat is better than conventional attacks on both the cross-detector
transferability and robustness to defenses. The adversarial images generated by
AdvHeat are also shown to have natural looks. Our code, including that for
generating a multi-view dataset consisting of 360 synthetic views for each of
1000 IDs from FaceForensics++, is available at
https://github.com/twowwj/AdvHeaT
- …